A Deep Dive into Course Descriptions. Using Quanteda to Identify Work-Based Learning Opportunities
Data@Urban Draft
This blog post is part two of a series on analyzing work-based learning opportunities in community colleges. In part one, we discussed how we used web scraping to gather course descriptions from community colleges in Florida. Now, we’ll delve into how we analyzed this data using the quanteda package in R.
1 Introduction
In our previous post, we detailed our journey of collecting course descriptions from Florida’s community colleges using web scraping techniques. We successfully compiled a comprehensive dataset, but the question remained: how do we make sense of this vast amount of text data? Enter quanteda, an R package designed for quantitative text analysis.
2 Getting Started with Quanteda
Quanteda, short for Quantitative Analysis of Textual Data, is a powerful tool for managing and analyzing text data in R. It offers a suite of functions for corpus management, creating document-feature matrices, analyzing keywords, and more. These functions are highly efficient and provide a consistent interface with support for multiple languages. While it operates as a standalone package, it also integrates seamlessly with extensions such as readtext, spacyr, and quanteda.textstats.
To install and load the required packages, we can use the librarian package for convenience
Next, we load our text data containing course descriptions as well as document-level meta-data using the readtext::readtext() function. We then perform some additional cleaning steps to standardize variable names and focus our analysis on active courses for credit.
The first step in our analysis is to create a corpus, a collection of text documents, from our course descriptions. We can achieve this using the corpus() function in quanteda. Next, we extract the tokens in the corpus—usually words, but they can also be n-grams or multi-word expressions. The tokens function allows us to define what we mean by tokens and apply some rules to ignore elements such as punctuation and digits.
Corpus consisting of 5 documents and 29 docvars.
BC-THE-2300 :
"A STUDY OF DRAMATIC LITERATURE FROM THE TIME OF THE EARLY GR..."
BC-JST-1500 :
"A SURVEY OF JEWISH CULTURE (JST1500) IS AN EXAMINATION OF JE..."
BC-LEI-1700 :
"AN OVERVIEW OF THE CHARACTERISTICS AND NEEDS OF MEMBERS OF S..."
BC-JOU-2200 :
"COURSE PROVIDES INSTRUCTION AND PRACTICAL EXPERIENCE IN COPY..."
BC-FRE-1121 :
"CONTINUATION OF FRE 1120. FURTHER DEVELOPMENT OF THE BASIC S..."
Tokens consisting of 5 documents and 29 docvars.
BC-THE-2300 :
[1] "A" "STUDY" "OF" "DRAMATIC" "LITERATURE"
[6] "FROM" "THE" "TIME" "OF" "THE"
[11] "EARLY" "GREEKS"
[ ... and 56 more ]
BC-JST-1500 :
[1] "A" "SURVEY" "OF" "JEWISH" "CULTURE"
[6] "JST1500" "IS" "AN" "EXAMINATION" "OF"
[11] "JEWISH" "THOUGHT"
[ ... and 18 more ]
BC-LEI-1700 :
[1] "AN" "OVERVIEW" "OF" "THE"
[5] "CHARACTERISTICS" "AND" "NEEDS" "OF"
[9] "MEMBERS" "OF" "SPECIAL" "GROUPS"
[ ... and 12 more ]
BC-JOU-2200 :
[1] "COURSE" "PROVIDES" "INSTRUCTION" "AND" "PRACTICAL"
[6] "EXPERIENCE" "IN" "COPY" "EDITING" "REWRITING"
[11] "HEADLINE" "WRITING"
[ ... and 19 more ]
BC-FRE-1121 :
[1] "CONTINUATION" "OF" "FRE" "FURTHER" "DEVELOPMENT"
[6] "OF" "THE" "BASIC" "SKILLS" "IN"
[11] "SPEAKING" "LISTENING"
[ ... and 51 more ]
3 Key-Term Searches with a Dictionary
Our primary interest lies in identifying courses related to different types of work-based learning such as internships, apprenticeships, or practicums. For each type of work-based learning experience, we create a list of terms that we want to treat equivalently. For instance, our dictionary can specify that a course description refers to a clinical WBL if either of the terms “clinicals” or “clinical experience” appear.
Code
dict <-
dictionary(list(apprenticeship = "apprentice*",
practicum = c("practicum", "practica"),
coop = c("co-op", "cooperative education", "co-operative education"),
clinicals = c('clinicals', 'clinical experience'),
on_the_job = c("on the job training", "job training", "on-the-job training"),
wbl = c('work-based learning', "work based learning", "wbl"),
real_world_experience = c("real-world experience", "real world experience"),
service_learning = "serive learning",
field_experience = c("fieldwork", "field experience", "field-experience")))Armed with our dictionary, we’re ready to search for the terms using the kwic() (key-word in context) function. This function takes in a corpus and a dictionary as inputs, along with a window parameter specifying the number of tokens before and after a keyword that we want to see for context.
Finally, we can join the results from the dictionary-based keyword in-context search back to the course level data and perform some wrangling to analyze the prevalence of work-based learning opportunities in Florida’s community colleges.
Code
courses_with_keywords <-
keywords %>%
left_join(courses, by = c("docname" = "doc_id")) %>%
mutate(sentence = glue("{pre} {keyword} {post}") |> str_trim() |> str_to_sentence()) %>%
separate(docname, into = c("school", "course"), sep = "-", extra = 'merge' ) %>%
mutate(discipline = str_remove(discipline, ".* - ")) %>%
select(school, discipline, course, statewide_course, degree_type, course_credits, sentence, pattern) |>
arrange(school, pattern, discipline) %>%
mutate(pattern = str_to_title(pattern) |> str_replace_all('_', '-'))Code
freq_table <-
courses_with_keywords %>%
group_by(school, pattern) %>%
count() %>%
group_by(pattern) %>%
bind_rows(summarise(.,
across(where(is.numeric), sum),
across(where(is.character), ~ "total"))) %>%
ungroup() %>%
pivot_wider(names_from = pattern, values_from = n) %>%
mutate(across(where(is.numeric), ~ replace_na(.x, 0))) %>%
slice(n(), 1:n()-1)4 Conclusion
In this post, we’ve demonstrated how to use quanteda to analyze course descriptions and identify work-based learning opportunities in community colleges. While our analysis focused on Florida, the same methods could be applied to other states or regions.
Through this analysis, we’ve gained valuable insights into the prevalence of work-based learning in Florida’s community colleges. We hope that our work can serve as a foundation for further research and policy discussions on this important topic.
The power of quanteda lies not just in its ability to handle large text data, but also in its flexibility. It allows researchers to tailor their analysis to their specific needs, whether that’s identifying key terms, comparing text documents, or exploring text patterns.
This blog post was co-authored by Manuel Alcalá Kovalski and Judah.